In [3]:
import coldatoms
import numpy as np
%matplotlib notebook
import matplotlib.pyplot as plt
import time
We evaluate the performance by doing a few force evalutations and measuring the elapsed time:
In [4]:
def measure_time(num_ptcls, per_ptcl_charges=False, num_iter=1, use_reference_impl=False):
ensemble = coldatoms.Ensemble(num_ptcls=num_ptcls)
ensemble.x = np.random.random([num_ptcls, 3])
if per_ptcl_charges:
ensemble.set_particle_properties('charge', np.random.random(num_ptcls))
else:
ensemble.ensemble_properties['charge'] = 1.0
f = coldatoms.CoulombForce()
if use_reference_impl:
f.use_reference_implementations()
accumulated_force = np.zeros_like(ensemble.v)
t0 = time.time()
for i in range(num_iter):
f.force(1.0e-1, ensemble, accumulated_force)
t1 = time.time()
return t1 - t0
We consider particle numbers on a logarithmic scale:
In [5]:
def num_ptcls(n_min, n_max, n):
return [n_min * (n_max / n_min)**(i/(n - 1)) for i in range(n)]
Here is the performance of the naive C implementation (for comparison we also show the performance obtained on a AWS EC2 C4 instance).
In [9]:
nptcls = np.array(num_ptcls(1, 1000, 30))
times = np.array([measure_time(int(round(n)), num_iter=5) for n in nptcls])
pairs_per_second = nptcls**2 / times
c4_nptcls = np.array([ 1.00000000e+00, 1.37382380e+00, 1.88739182e+00, 2.59294380e+00,
3.56224789e+00, 4.89390092e+00, 6.72335754e+00, 9.23670857e+00,
1.26896100e+01, 1.74332882e+01, 2.39502662e+01, 3.29034456e+01,
4.52035366e+01, 6.21016942e+01, 8.53167852e+01, 1.17210230e+02,
1.61026203e+02, 2.21221629e+02, 3.03919538e+02, 4.17531894e+02,
5.73615251e+02, 7.88046282e+02, 1.08263673e+03, 1.48735211e+03,
2.04335972e+03, 2.80721620e+03, 3.85662042e+03, 5.29831691e+03,
7.27895384e+03, 1.00000000e+04])
c4_pairs_per_second = np.array([ 8.21896849e+01, 8.99578985e+04, 1.69785802e+05, 3.20452334e+05,
6.04819115e+05, 1.04640310e+06, 1.97497265e+06, 3.40804318e+06,
5.36026069e+06, 8.44192672e+06, 1.04605076e+07, 1.35549480e+07,
1.54981408e+07, 1.70811388e+07, 1.48204588e+07, 1.83218908e+07,
1.87899007e+07, 1.81859734e+07, 1.84545152e+07, 1.85655733e+07,
1.86642306e+07, 1.86659059e+07, 1.87018426e+07, 1.87843680e+07,
1.87249206e+07, 1.87188841e+07, 1.86915702e+07, 1.86896431e+07,
1.86820557e+07, 1.87087450e+07])
c4_pairs_per_second_opt = np.array([ 7.77428407e+01, 8.60466855e+04, 1.77870840e+05, 3.35711969e+05,
5.78522632e+05, 1.14153065e+06, 2.06084102e+06, 3.68911890e+06,
5.77258843e+06, 9.17072615e+06, 1.24658899e+07, 1.54452639e+07,
1.81193908e+07, 1.97748636e+07, 1.66558348e+07, 2.03540609e+07,
2.14466467e+07, 2.16752990e+07, 2.15554241e+07, 2.16672624e+07,
2.17012696e+07, 2.17274807e+07, 2.16995063e+07, 2.17516952e+07,
2.17600542e+07, 2.17549355e+07, 2.17497805e+07, 2.17769467e+07,
2.17739088e+07, 2.17851869e+07])
plt.figure()
plt.loglog(nptcls, pairs_per_second);
plt.loglog(c4_nptcls, c4_pairs_per_second);
plt.loglog(c4_nptcls, c4_pairs_per_second_opt);
The latency can be inferred from the time it takes to deal with just one pair:
In [7]:
times[0]
Out[7]:
In the limit of large numbers of particles we process on the order of $6\times 10^6$ particle pairs per second:
In [8]:
pairs_per_second[-1]
Out[8]:
For the numbers of particles considered here we do not observe any cache effects yet. This could be due to inefficiencies in the force evaluation function or due to the relatively small number of particles (Easily fits in L2, almost fits into L1 cache). We can model the processing speed using latency and the processing speed at large numbers of particles.
In [9]:
def const_rate_model(latency, rate, num_ptcls):
num_pairs = num_ptcls**2
total_time = latency + num_pairs / rate
return num_pairs / total_time
In [15]:
plt.figure()
plt.loglog(nptcls, pairs_per_second)
plt.loglog(nptcls, [const_rate_model(5.0e-5, 8.0e6, n) for n in nptcls]);
In the following we compare a few different implementations.
Some of the timing data required recompilation of the coldatoms c-extension and thus restarting of the python kernel. We therefore inline the timing data here. For all timing data we use the following numbers of particles:
In [16]:
nptcls = np.array([ 1. , 1.268961 , 1.61026203, 2.04335972,
2.5929438 , 3.29034456, 4.17531894, 5.29831691,
6.72335754, 8.53167852, 10.82636734, 13.73823796,
17.43328822, 22.12216291, 28.07216204, 35.6224789 ,
45.20353656, 57.3615251 , 72.78953844, 92.36708572,
117.21022975, 148.73521073, 188.73918221, 239.502662 ,
303.91953823, 385.66204212, 489.39009185, 621.01694189,
788.04628157, 1000. ])
The reference implementation has this performance:
In [17]:
pairs_per_second_reference = np.array([ 3754.97224709, 10422.72910995, 6161.80993837, 10058.90690229,
7977.31411846, 12252.85370598, 7772.16310828, 10301.24072883,
8244.43945831, 9103.42166074, 9766.48063046, 9611.93684861,
10848.58925705, 10003.22038508, 10536.11487913, 10502.85977021,
13790.40041135, 14756.04096312, 13686.40446465, 14516.38360699,
13543.29197737, 13759.00597281, 13842.27083136, 13488.36978563,
12883.47362135, 12343.43336072, 11535.69300621, 11728.47328488,
11188.22477577, 8771.32862753])
Here is the performance of a direct translation of the reference implementation to C:
In [18]:
pairs_per_second_naive_c = np.array([ 10699.75510204, 30839.85584258, 55206.0638283 ,
90738.63706541, 136230.94427643, 221507.68668544,
363783.63504059, 580015.67337796, 857906.67158507,
1327397.61712042, 1862179.30155215, 2692617.37091628,
3417509.20801424, 3759433.7356532 , 5912890.28819334,
6210511.33097665, 6165807.07674836, 6578029.24543723,
6245854.91663751, 7587882.39220302, 7396963.5969694 ,
7803134.84028501, 8355880.86492011, 8627377.42296997,
8725380.89446372, 8792556.68439878, 8841519.959524 ,
8266728.56837714, 6405629.27527453, 7742010.74647583])
Here is a C implementation with where the outer and inner particle loops in the pairwise force are chunked to enable vectorizatio and caching. This version was compiled with -msse4 and thus uses SSE4 vectorization. This is currently the default implementation used in the coldatoms library.
In [19]:
pairs_per_second_chunked = np.array([ 4.40592035e+01, 3.26276736e+04, 5.57722799e+04,
8.88962282e+04, 1.41707565e+05, 2.25915800e+05,
3.63783635e+05, 5.63364506e+05, 8.46416850e+05,
1.31595453e+06, 1.99843632e+06, 2.58702453e+06,
3.70561318e+06, 4.41430284e+06, 5.56448766e+06,
4.47261194e+06, 6.02281928e+06, 6.84558797e+06,
6.13549194e+06, 6.95519016e+06, 6.78948349e+06,
6.53108161e+06, 6.84557435e+06, 6.45104368e+06,
6.50288098e+06, 6.45530515e+06, 5.69232280e+06,
4.99511738e+06, 7.79344758e+06, 7.58281281e+06])
The following version was compiled with -mavx. On the laptop used for these timing studies this is the most advanced vector instruction set.
In [20]:
pairs_per_second_chunked_avx = np.array([ 12409.18343195, 37943.4181434 , 62146.25470958,
100071.75402071, 168861.11057018, 267112.2104148 ,
395246.00347654, 684553.38195189, 1030420.5131538 ,
1503948.03910195, 2204553.07448326, 3166518.02819754,
4652302.68098289, 5654685.45362713, 7002766.25233067,
6867623.49760902, 8232922.03331433, 9133491.30173879,
8718223.53605189, 8283438.27815497, 7369528.89377051,
7376934.04244149, 8322369.84045209, 7516375.83786946,
7549459.96638704, 7623711.51199181, 7380784.94405883,
6349442.00772738, 6432029.20628165, 8554706.17509566])
The following implementation uses OpenMP to parallelize the outer loop over particles. The laptop used for these timing studies has 2 cores. Therfore only a modest acceleration was achieved. For large numbers of particles we expect a nearly ideal speedup of 2x.
In [35]:
pairs_per_second_chunked_avx_openmp = np.array([ 1.43566798e+02, 1.78481764e+02, 3.11719871e+02,
4.78615932e+02, 7.87616061e+02, 1.12574251e+03,
1.99908441e+03, 3.10250538e+03, 5.52021704e+03,
5.58577039e+03, 1.43348983e+04, 2.15350791e+04,
3.36393871e+04, 2.45797009e+04, 5.68420037e+04,
5.04015929e+06, 7.26927213e+06, 8.20006260e+06,
9.90760223e+06, 9.57827981e+06, 8.98103903e+06,
1.04068053e+07, 1.03363200e+07, 3.15302633e+06,
2.49586490e+06, 7.32430894e+06, 6.46459903e+06,
7.77060651e+06, 1.17015150e+07, 1.19050503e+07])
In [34]:
plt.figure()
plt.loglog(nptcls, pairs_per_second_reference, '.-')
plt.loglog(nptcls, pairs_per_second_naive_c, 's-', ms=5)
plt.loglog(nptcls, pairs_per_second_chunked, 'd-', ms=5)
plt.loglog(nptcls, pairs_per_second_chunked_avx, '^-', ms=5)
plt.loglog(nptcls, pairs_per_second_chunked_avx_openmp, 'v-', ms=5)
plt.xlabel('nptcls')
plt.ylabel('pairs / s');
Here are some observations about the relative and absolute performance of the different implementations:
In [ ]: